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Abstract 

Motivation: Detecting local correlations in expression between neighbor genes 
along the genome has proved to be an effective strategy to identify possible causes 
of transcriptional deregulation in cancer. It has been successfully used to illus¬ 
trate the role of mechanisms such as copy number variation (CNV) or epigenetic 
alterations as factors that may significantly alter expression in large chromosomic 
regions (gene silencing or gene activation). 

Results: The identification of correlated regions requires segmenting the gene ex¬ 
pression correlation matrix into regions of homogeneously correlated genes and 
assessing whether the observed local correlation is significantly higher than the 
background chromosomal correlation. A unified statistical framework is proposed 
to achieve these two tasks, where optimal segmentation is efficiently performed 
using dynamic programming algorithm, and detection of highly correlated regions 
is then achieved using an exact test procedure. We also propose a simple and ef¬ 
ficient procedure to correct the expression signal for mechanisms already known 
to impact expression correlation. The performance and robustness of the proposed 
procedure, called SegCorr, are evaluated on simulated data. The procedure is illus¬ 
trated on cancer data, where the signal is corrected for correlations possibly caused 
by copy number variation. The correction permitted the detection of regions with 
high correlations linked to DNA methylation. 

Availability and implementation: R package SegCorr is available on the CRAN. 
Contact: eldelatola@yahoo.gr 
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1 Introduction 


In the last decade, the study of local co-expression of neighbor genes along the chro¬ 
mosome has become a question of major importance in cancer biology ( [Clark[|2007| . 
The development of’’Omics” technologies have permitted the identification of several 
mechanisms inducing local gene regulation, that may be due to a common transcrip¬ 
tion factor (|De and Babul 2010| > or common epigenetic marks (Stransky et al. 2006 


Frigola et al. 2006). Copy number variation due to polymorphism or to genomic insta¬ 


bility in cancer is also a possible cause for observing a correlation between neighbor 
genes (Aldred et al. 2005[ >, as their expressions are likely to be affected by the same 
copy number variation (CNV). It has further been observed that local regulations may 
occur in specific nuclear domains, as the nuclear region is an environment which may 
favor or not transcription (Bickmore 20131. 


Investigating the impact of a specific source of regulation (TF, DNA methylation, 
histoine modifications, CNV) on the expression has now become a common practice 
for which statistical tools are readily available. On the other hand, only a few methods 
have been proposed to focus on the direct analysis of gene expression correlation along 
the chromosomes. The direct analysis of correlations may have different purposes: 


( i ) one can aim at detecting all potential chromosomal domains of co-expression, 
then investigating to which extend known causal mechanism are responsible for 
the observed co-expression patterns, 

(ii) one can aim at detecting chromosomal domains of co-expression where correla¬ 
tions are not caused by already known sources of regulation, in order to identify 
new potential mechanisms impacting transcription. 


Addressing problems (i) and (ii) is crucial to fully understand transcriptional deregu¬ 
lation and/or to model gene regulation. 

We first consider problem (i) and provide a precise definition of our purpose: one 
aims at identifying correlated regions, i.e. blocks of neighbor genes, the expression of 
which display correlations across patients that are significantly higher than expected. 
Indeed, it has been observed that background correlation between adjacent genes along 
the genome does exist. This should not be confounded with the co-expression that can 
be locally observed due to the aforementioned mechanisms. Note that this definition 
does not include detection of regions with biased gene expression as studied in Nilsson 
et al. ( 2008|>, Xiao et aI7|( 2009|> or Seifert et al. (|2014[>, that account for potential corre¬ 


lations between adjacent genes but do not aim at detecting highly correlated regions. 

Several approaches have already been proposed to tackle problem ( i ). Clustering 
methods accounting for the chromosomal organization of the genes have been proposed 
in CluGene ( |Dottorini et al.[ |2013| l and DIGMAP ( |Yi et ah 2005!. Sliding windows 
procedures have also been considered, such as G-NEST ( Lemay et al.[ 2012| >, REEF 
( Coppe et al.| 2006[ > or TCM (Reyal et al. 2005 1 >. The principle is then to compute 
correlation scores for genes falling within the window, then to detect local peaks of 
high correlation scores. While these procedures have been successfully applied to can¬ 
cer data, all tackle the detection of correlated region using heuristics. As such, they 
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suffer from classical limitations associated with these techniques, including local opti¬ 
mum (for clustering algorithms) or detection instability according to the choice of the 
window size (for sliding windows). 

It is now well known that the problem of finding regions in a spatially ordered sig¬ 
nal can be cast as a segmentation problem, for which standard statistical models exist, 
along with efficient algorithms to find the globally optimal solution. According to our 
definition, the detection of correlated regions boils down to the block-diagonal seg¬ 
mentation of the correlation matrix between gene expressions. Such an approach has 
been proposed in image processing ( Lu et al.) 2013[ >, finance ( |Lavielle and Teyssiere[ 
2006) and bioinformatics for CNV analysis (Zhang et al. 2010|), but to the best of our 


knowledge it has never been considered for the detection of correlated expression re¬ 
gions. 


While problem ( i ) can be addressed on the basis of only expression data, problem 
(ii) requires the additional measurement of the signal one needs to account for. For 
example, consider that one seeks for locally expressed co-regulation events that are not 
due to copy number variations to include copy number changes due to polymorphism 
or genomic alteration observed in cancer. The strategy we adopt here consists in first 
correcting the expression data for potential CNV regulation, then in applying the pro¬ 
cedure described to solve problem (i) on the corrected signal. The corrected signal 
is obtained by regressing the initial expression signal on the CNV signal. Although 
quite simple, the strategy turns out to be efficient in practice. An alternative strategy 
would be to jointly model both the expression and the signals to correct for, and then 
propose within this framework a correction. Such a strategy would necessitate to adapt 
the modeling to the specific combination of signals one has at hand. In comparison, 
the regression procedure proposed here can be applied to any kind and any number of 
signals one needs to correct for. 


The outline of the present article is the following. In Section[2]we propose a para¬ 
metric statistical framework for the problem of correlated region identification. Finding 
regions of co-regulated genes can then be achieved by maximum likelihood inference 
(to find the boundaries of each region along with their correlation levels). An exact test 
procedure to assess the significance of the correlation with respect to background corre¬ 
lation is proposed in Section [3] We introduce a simple procedure to correct expression 
data beforehand for some known (and quantified) source of correlation. Because the 
background correlation level is a priori unknown, an estimator of this quantity is also 
proposed. The performance of the SegCorr procedure is illustrated in Section |4] on 
simulated data, along with a comparison with the TCM algorithm proposed in |Reyal| 
et al. ( 2005|. Finally, a case study on cancer data is presented in Section[5] 
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2 Correlation matrix segmentation 


2.1 Statistical model 

We consider the following expression matrix: 


Y n ■ 

■■ Y 1P 

y 21 ■ 

•• Y 2p 

Y n 1 ■ 

Y 

1 n P. 


where Y- r) stands for the expression of gene j (j = 1,... ,p) observed in patient i (i = 

1 .. .. ,n). The i-th row of this matrix is denoted Y, and corresponds to the expression 
vector of all genes in patient i. In order to detect regions of correlated expression, 
we consider the following statistical model. Profiles { Y, } i <,<„ are supposed to be 

1.1. d, normalized (centered and standardized), following a Gaussian distribution with 
block-diagonal correlation matrix G: 



1 • •' Pk 

withS fc (1) 

Pk ■■■ 1. 

The model states that genes are spread into I\ contiguous regions, with respective 
lengths Pk(k = 1 ,K, J2i<k<K Pk = P)- Genes belonging to different regions are 
supposed to be independent, whereas genes belonging to a same region are supposed 
to share the same pairwise correlation coefficient ftp-. This amounts to assume that 
some specific effect (e.g. methylation) affects the expression of all genes belonging 
to the region. More specifically, let Up- denote the vector of the region effect (accross 
patients). For all genes j from region k, the model can be written as Yp = p i: j + 
Uik + Epj. The error terms U, :l are all independent and independent from U-,k such that 

v(E/i*)/v(y y ) = Pk . 

2.2 Accounting for known sources of regulation 

As mentioned in the Introduction, a second task (ii) can be to detect correlated regions 
which are not due to an already known mechanism. While in the model of the previous 
section, Y l:) denoted the expression of gene j for patient i. In this case, Y, :l refers to the 
gene expression corrected for some underlying mechanism. This correction step can 
be done using the following regression model: 

Ypj /3q T (| xpj T cpj , (2) 


4 



where x, t j stands for the covariate observed in patient i for gene j. For instance, in the 


illustration of Section 5.4 Xjj is the copy number associated to patient i at location 


of gene j. The corrected signal is then Y tJ = Yij - Bo - Bi . Note that it suffices 
to assume that (e^) are independent among patients (but not among genes) to get the 
standard linear regression estimates (see [Anderson ( T958| , Chapter 8). 

Several articles have investigated the relationship between gene expression and 
mechanisms such as CNV or methylation, and proposed sophisticated models for this 
relationship (Menezes et al. 2009[ van Wieringen et al.[ 2010||Leday et al. 20131. Al¬ 
ternatively, gene expression can be predicted using one of these models. Then, these 
predictions can replace the explanatory variable x in Equation 0. The residuals of the 
resulting model can be considered as the corrected signal. 


2.3 Inference of correlated regions 

Parameter inference in Model {T} amounts to estimate the number of regions K, the 
region boundaries 0 = To < t\ < ■ ■ ■ < tk +i = p, and the correlation parameters 
Pi,px within each of these regions. Here, we consider a maximum penalized like¬ 
lihood approach. First, we show that for a given K the optimal region boundaries and 
correlation coefficients can be efficiently obtained using dynamic programming. The 
number of regions can then be selected using a penalized likelihood criterion. 

For a fixed K, the estimation problem can be formulated as follows: 


arg max max C (3) 

ti<—<tk Pi, — ,pk 

where the log-likelihood C is expressed as 

-2 C = ttlog |G| + tr [YG~ 1 (Y) t ] .. 

Here, thanks to the block diagonal structure of the correlation matrix in Model (|T]), the 
log-likelihood can be rewritten as 


-2C = 


K 

{ nl °s 


Sfcl + tr 


y(*)£ 


-i 


k =1 


(y (fc) ) T J } 


where Y^ stands for the set of expression from Y corresponding to genes included in 
the £>th region. Moreover defining the log-likelihood in region k containing genes 
from Tfc_ i + 1 to tx. as 


— 2 Ck 


-2£(r fe _i + 1,7*) 


ft log |£fc| + tr 


y( fc )j]- 1 (y( fc )) T 


the optimization problem ([3]) boils down to 


K 

arg max > ma xCk- 

Ti<--‘<T K ' Pfc 

k= 1 


(4) 

(5) 


( 6 ) 
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Inference when K is fixed We first show that for a given region k with known bound¬ 
aries, explicit expressions can be obtained for both the ML estimator pk and the likeli¬ 
hood Ck at the optimum: 


Lemma 1 The maximum of Ck with respect to pk is reached for 


Pk = 


e r Ei k G Jk ~p k 

pi - ^ 


(7) 


where Gjk •= n 1 YijYik- Furthermore, the maximal value of Ck is given by 

,r \ , , ^ ( E? Ei k G jk - Pl\ 

V Pk-Pk J 

Gjk 

Pk 

The proof is given in Appendix[A] The expression of Problem (|6]i is 

K 

arg max \ Ck 

T\<---<Tk ^ 


( 8 ) 


(9) 


fc=l 


which is additive with respect to the Ck terms that can be straightforwardly computed 
thanks to Lemma |T] Consequently, optimization can be performed via Dynamic Pro¬ 
gramming (DP, Lavielle ( |2QQ5| >, [Picard et ak ( |2005 i). The optimal boundaries, and 
correlation estimators can be obtained at computational cost 0(Kp 2 ). 

Lasso-type approaches have been proposed to tackle segmentation problems in a faster 
way (see e.g. |Tibshirani and Wang] ( )2008| ). First, note that such methods rely on a 
relaxation of the original problem, so that the result may be different from the exact so¬ 
lution of problem (j8). Furthermore, as for matrix segmentation, such approaches have 
been proposed ( |Bien ancTT ibshirani (2011!; Levina et al. (2008 )), which do not allow 
to capture the longitudinal structure (i.e. blocks of neighbor genes). 

Model selection. To choose the number of regions, we adopt the model selection 
strategy proposed in Lavielle ( 2005| . For each 1 < K < /(' Ili;ix , we define the maximal 
log-likelihood for K regions as 


K 


L k = max V £(r fc _i + 1, r k ) . 

r 1 <---<r K 


k= 1 


Furthermore, the normalized log-likelihood is defined as 

Lk — Lk 

1Y max x\. . 


Lk = 


Lk — L i 

1 v max J 


■(7v max — Ai) + 1, 
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where Kj = 5 x j + 2 x j log (p/j) is the penalty function. Lavielle (2005) suggests to 


estimate the number of regions I\ as the value of K such that L k displays the largest 
slope change. Namely, we take 

K = argrnin {(Z*- - L K+ 1 ) - (L k +i - L K + 2 ) > S'} , 

where the value of threshold S is predefined. Throughout the paper, S = 0.7 as sug¬ 
gested in |Lavielle ( |2005 1 . The robustness of the results with respect to other values for 
threshold S is investigated in Section [4] This global approach (dynamic programming 
and model selection) has been applied with success for CNV detection (see Picard et al. 
( |2005j ) and |Lai et al.| ( j2005| > for a comparative study.) 

3 Assessing correlation significance 


It has been observed (|Cohen et al. 2000; Spellman and Rubin, 2002 1 Reyal et al. 2005 


Stransky et al. 2006) that background correlations may exist between adjacent genes 


along the genome, i.e. one expects the correlation level in any region to be positive. 
As a consequence, one has to check whether a given region exhibits a correlation level 
that is significantly higher than the background correlation level p 0 , that is observed by 
default. 


Test procedure. Once the correlation matrix segmentation is performed, it is possible 
to identify regions with high correlation levels by testing Hg '■ Pk = Po vs H\ : p k > 
p (} . This can be done using the following test statistic for region k: 


T k = 



l 


where 


T fc n 

y£ ] - E Y a and Y " = E F /. fc) • 

j=T k - 1+1 t= 1 

Assuming Model © is true, test statistic Tk has distribution 

T k ~ A {pk,Pk)Xn-i where A (j>k,Pk) = ^ + < ' Pfc -• 

npk 

Here Xn-i stands for the chi-square distribution with n — 1 degrees of freedom. The 
proof is given in Appendix [B] We emphasize that this test is exact and does not rely on 
any resampling strategy. 

Consequently, the p-value associated to region k is given by 

P(\(p k ,p 0 )Z>TZ bs ) , where Z ~ xl-i- 
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Statistical power. We now study the ability of the proposed test to detect a region 
with width p where the correlation p is higher than in the background. The probability 
to detect such a region depends on both p and p, and writes 

P°(n,p,p) = 


where Z ~ Xn-i an d Qn-ip-ct is the 1 — a quantile for the Xn-i distribution. Fig¬ 
ure |T] (Top) displays the evolution of power for different values of p and p. Here p 0 
and n are fixed at 0.15 and 58, respectively, which correspond to the values observed 
in the reference dataset (see Section [5]). The nominal levels of a are 5%, 0.5% and 
0.05%. These correspond to realistic thresholds, once multiple testing corrections such 
as Bonferroni or FDR are performed. One can observe that even for small values of 
p, the power is high whatever the nominal level as soon as the number of genes in the 
considered region is equal to or higher than 5. It also shows that the procedure will 
probably fail to find regions of size 3, if the correlation is not at least as high as 0.7 (to 
obtain a power of 0.8). On the same graph (Bottom), one observes that a sample of size 
50 is sufficient to efficiently detect regions of size 5, as long as the correlation is higher 
than 0.6. Larger samples will be required if one wants to efficiently detect regions with 
smaller correlation levels. 


Pr {T > \(jp,p 0 )q n -i tl - a } 


Pr i Z > 


Hp,Po) . 

Hp,p) 


'Qn— 1,1—a 


Background correlation estimation. The test procedure requires the knowledge of 
parameter that is unknown in practice. However, it can be estimated using 

Po = |median(corr(F i _i,F i ))| . (10) 

i>l 

Under the assumption that most pairs of adjacent genes display a po correlation, i.e. 
only a few number of regions with moderate sizes exhibit a high level of correlation, 
po is a robust estimator of the background correlation. The behavior of estimator ( [T()| 
is investigated in Section[4] 

4 Simulation study 

In this section, we first study the quality of the proposed estimator of po- Then we 
study the ability of SegCorr to detect correlated regions and compare its performance 
with this of TCM algorithm. The robustness of the method with respect to the choice 
of the model selection threshold S will be investigated in Section [5?2| on real data, since 
very little difference were observed on the simulated data (results not shown). 


4.1 Simulation design 

For each round of simulation, a sample of n = 58 profiles with 22 chromosomes each 
is generated, with a number of genes per chromosome identical to the one observed in 
the reference dataset (see description in Section 5.1 1 . Each chromosome is split into 
K c hr regions, according to the segmentation obtained in Stransky et al. (2QQ6)>. Each 
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Figure 1: Top: Power curves as a function of p, for a fixed cohort size n = 58 and 
varying region width p = 3,5,10, 20. Bottom: Same graphs for a region of fixed width 
p = 5 but varying cohort sizes n = 10, 50, 200,1000. In all graphs po is fixed at 0.15. 
The nominal level a of the test is set to 5% (left), 0.5% (center), 0.05% (right) 
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Figure 2: Left: Length of II \ regions in the reference dataset. Right: Distribution 
of the backgroung correlation po obtained from the reference data according to the 
segmentation obtained in Stransky et al. (2006)>. 


segmentation alternates between Ho regions, i.e. regions with nominal background 
correlation po, and Hi regions with nominal correlation p\. Figure[2]depicts the length 
of the Hi correlated regions obtained in the Stransky et al.[( j2006| i study. As it can 
be seen, this length varies substantially from a H\ region to another. For pi, different 
values were considered (ranging from 0.3 to 0.9), all Hi regions of all chromosomes 
sharing this same p\ coefficient. For po, two cases are considered : 

• Scenario 1 (Easy case): po is the same for all Hq regions of all chromosomes (3 
different values considered: 0.08, 0.18, 0.28). 

• Scenario 2 (Realistic case): po is identical for all // (l regions within a chro¬ 
mosome, but varies from a chromosome to another. The specific value chosen 
for a given chromosome is the one obtained for this same chromosome on the 
reference dataset. The distribution of p 0 across chromosomes on the reference 
dataset is given in Figure [2] 

Additionally, correlations between genes from different regions were observed on the 
reference dataset. Thus, a similar extra block diagonal correlation pattern was gener¬ 
ated in the simulations, with a level of background correlation between blocks similar 
to the one within blocks. 

Lastly, for each combination (p 0 , p\) the simulation was replicated 20 times. 

4.2 Quality of the p 0 estimator 

For this study, we consider Scenario 1. Figure[3]illustrates the estimation accuracy of po 
under different levels of both // () and II \ correlations on chromosome 3. Estimator ( [T()| 
yields in over-estimated values of the true background correlation level. One observes 
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Figure 3: Left: estimation of p 0 for chromosome 3 under different levels of both H 0 
and Hi correlations (po = 0.08,0.18 and 0.28). Dashed lines indicate the true po- 
Right: estimation of po for po = 0.18 and different levels of Hi correlations according 
to the fraction of Hi correlations (the results are showed for five typical chromosomes 
only). 


that the overestimation does not depend on the correlation level in Hi regions, thanks 
to the use of the median. Still, as expected, it is linked to the proportion of pairs of 
adjacent genes with Hi correlations, as showed in Figure [3] Importantly, while over¬ 
estimation of po will result in a decrease of power, it will not increase the false positive 
rate (FDR or FWER). 

4.3 Performance evaluation 

To assess the performance of SegCorr, the true positive rate (TPR = sensitivity), false 
positive rate (FPR = 1— specificity) and area under the ROC curve (AUC) were con¬ 
sidered. These criteria were first computed at the gene level. However, as the goal 
is to identify correlated regions, a definition of TPR and FPR at the region level was 
adopted. We considered the intersection between the true and the estimated segmenta¬ 
tions and computed the number of true/false positive/negative regions. This amounts at 
classifying each gene into one of four status (true/false x positive/negative) and then 
to merge neighbor genes sharing a same status into regions. The status of a region is 
given by the status of its genes. Consequently, criteria computed at the region level are 
more stringent as they measure the precision of region boundary estimation. 

Figure [4] shows the AUC for the first simulation scenario under various configura¬ 
tions, with pi fixed at 0.5. When po is between 0.08 and 0.18, most regions are cor¬ 
rectly detected. For po = 0.28 (a value higher than what is observed on the reference 
dataset, see Figure |2j, the task becomes difficult and the performances deteriorate. 

On the second simulation study, the behavior of SegCorr was explored under dif- 
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Figure 4: AUC at the gene level (red) and region level (blue). The higher the AUC the 
better. Left: Simulation design 1 with fixed pi = .5 (a>axis: po). Right: Simulation 
design 2 (s-axis: pi). 


ferent p\. Obviously the task becomes easier when (>\ gets larger. Figure [4] shows 
that SegCorr performs well when 0.5 < pi < 0.9. When p\ < 0.5, (remind that the 
background correlation can be as high as 0.2, see Figure |2| although the performances 
remain good at the gene level, the boundaries of the regions are detected less accurately. 


4.4 Comparison with the TCM algorithm 

SegCorr was compared with the TCM algorithm introduced by [Reyal et al.| ( j2005j ) for 
the detection of local correlations. In the literature, many methods tackling the same 
problem as the TCM have been proposed. The choice of the TCM as a competing 
method was based on the availability of the code. Figure [5]displays the AUC achieved 
by SegCorr and TCM under Scenario 1. When po is large (p () = 0.28), one observes 
that the mean performance of both methods are comparable with higher variability for 
SegCorr at the gene level and at the region level for TCM. Since the aim is to detect 
regions rather that genes, the SegCorr procedure seems more appropriate. For small or 
medium values of background correlations (po = 0.08, 0.18) SegCorr achieves better 
AUC than TCM at both the gene and the region levels. As a conclusion, SegCorr 
appears to be a more consistent and efficient procedure to detect correlated regions. 

5 Bladder cancer data 


5.1 It is now well 


In this section, we apply SegCorr on the dataset described in Section 
known that copy number variation (CNV) impacts gene expression ( |Sebat et ah 2004| >. 
Here our goal is to detect regions where the correlation is not due to CNV. Therefore 
we correct the expression signal for CNV variation according to the strategy described 
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Figure 5: AUC of the SegCorr (red) and TCM (blue) algorithms for the simulation 
scenario 1 as a function of po■ Left: gene level. Right: region level. 


in Sections |2.1| and |5.3| The effect of this correction is investigated in Section |5.4| 
Lastly, Sec tion[575| illustrates the biological results obtained after correction for CNV. 


5.1 Data presentation 

The dataset consists of n = 58 bladder tumors. Gene expression were measured using 


exon 1.0 Affymetrix arrays and RMA normalisation Irizarry et al. (2003) was applied. 
The number of genes per chromosome ranges from a 293 to 2192 (with average 950). 
Additionally genomic and methylation data were collected for the same tumors. Ge¬ 
nomic data were obtained with Illumina Human610-Quad SNP arrays and methylation 
data with Illumina Human methylation 450k arrays. For the latter, the normalization 
procedure proposed by Teschendorff et al. ( 2013[ > was used. The combination of probe 
positions was made according to the custom BrainArray EntrezGene annotation (Dai 
[et~aL||2005 ). 


5.2 Study of the model selection threshold S 

For the model selection criterion (see section [273] ), the threshold S must be tuned in 
such a way to avoid under/over-segmentation. The smaller the value of S the higher 
the number of segments. As stated in Section 2.3 S was fixed to 0.7 as advocated in 


Lavielle ( 2005) >. Figure[6]shows the evolution of the number and location of II \ regions 
detected by SegCorr according to S on a typical chromosome (chromosome 3). One 
can see that most of these Hi regions are stable for values of S between 0.6 and 0.9. 
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Figure 6: Left: statistically significant regions in red obtained for different values of 
S. The vertical lines correspond to the ones obtained with the default value of S we 
considered (S = 0.7). Right: number of statistically significant regions for different 
values of S. The dotted vertical red line corresponds to S = 0.7. 


5.3 Procedure for CNV correction 

To correct the expression signal from CNV, one first needs to detect the CNV regions 
from the SNP signal. To this aim, we consider the segmentation method proposed by 
Picard et al.| ( |20lf) implemented in the R package cghseg. Denote SNPi t the SNP 
signal of patient i at position / , the model writes 

SNP lt = ^ k + E it if t e I\ = [4_! + 1, 4]. (11) 

where the P lt are i.i.d centered Gaussian with variance a 2 . The method estimates the 
number of regions, the boundaries of the regions, denoted t l k and the signal mean within 
each region k in patient /, denoted fi t k. 

We then use the regression model 0 to make the correction where Xij is the mean 
obtained previously if the SNP position t\ corresponds to gene j of the expression 
signal in patient i. Since the SNP and expression signals are not aligned, there might 
be either one, many or no SNP probes that belong to the corresponding gene region. 
We then propose to define x 13 as follows : if one or many probes are related to gene j, 
mean fi t k or the average of the different means is considered respectively; if there is no 
probe, a linear interpolation is performed. 

5.4 CNV Dependent Region 

We first investigate the effect of CNV correction (described in Section [53] ) by compar¬ 
ing the results obtained on the raw and corrected signals. Figure[7]displays the number 
of significant Hi regions as a function of the test level o for both the raw and corrected 
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Chrom. Genes 


2 ASAP2, ITGB1BP1, CPSF3, IAH1, ADAM 17, YWHAQ, TAF1B 
6 MBOAT1, E2F3, CDKAL1, SOX4, LINC00340 
12 MDM2, CPM, CPSF6, LYZ, YEATS4, FRS2, CCT2 
16 COG7, GGA2, EARS2, UBFD1, NDUFAB1, PALB2, DCTN5 

Table 1: Four examples of CNV-dependent regions 


signals. For small values of a (which are typically used for testing significance), the 
number of detected regions are quite similar. However, only one third of the detected 
genes are common, meaning that the regions detected with the two signals are quite 
different. Furthermore, as the correction remove all effects due to CNV, the estimated 
background correlation is lower in the corrected signal than in the raw signal (mean 
decrease across all chromosomes of .02). This makes the test we propose more pow¬ 
erful and explains why, while CNV-due regions are removed, the number of detected 
regions for given a remains about the same. 

To illustrate this phenomenon more precisely, we considered a set of four regions 
in chromosomes 2, 6, 12 and 16 known to be associated with CNV in bladder cancer 
(Heidenblad et al. 2008 1 Network et al. 2014[ >. These regions, given in Table [T] are 
detected by SegCorr when applied to the raw expression data. When considering the 
corrected signal, these regions are not detected any more. For example, when consid¬ 
ering the region in chromosome 2, the background correlation was po = 0.163 and 
the correlation within this region was pk = 0.561, resulting in a highly significant p- 
value: 2.1e-7. After correction we get p 0 = 0.132 and pk = 0.284, which results in a 
non-significant p-value: 2.5e-2. 

More generally, over the 184 regions solely detected on the raw signal with p-value 
smaller than 5% (before multiple testing correction), more than a half (98) get non 
significant when considering the corrected signal. This explains a substantial part of 
the difference between the regions detected on raw and corrected signals. This also 
shows that the proposed CNV correction strategy performs reasonably well. 


5.5 CNV-independent regions 

General description. When applied to the CNV-corrected expression signal, SegCorr 
detected 569 significant regions (p-value adjusted < 0.05) which are distributed through¬ 
out the genome (an average of 23 regions per chromosome). Among these regions, 
158 regions contained well known gene family clusters such as the HOXA, HOXB, 
HOXD clusters, several KRT clusters, the epidermal differentiation complex, and HLA 


gene families clusters whose expression is known to be co-regulated (Sproul et al. 


2005[ >. We next undertook a Gene Ontology terms analysis and identified an enrichment 
of genes belongs to the keratinization pathway (p-value 4.09E-19 and FDR (/-value 
9.01E-16). The expression of this pathways is strongly associated with a subgroup of 
bladder cancer called basal-like bladder cancer ( |Rebouissou et al.||2014[ ). 11/28 CNV 
independent regions detected by TCM using different platforms (Affymetrix U95A for 
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Figure 7: Left: Number of statistically significant regions as a function of a (solid 
line: corrected signal, dotted line: raw signal). Right: proportion of significant genes 
common in the two signal as a function of a. 


the transcriptome and BAC arrays for the genomic alterations) were also detected by 
SegCorr ( jStransky et al.[|2006| . 


An example of epigenetic region. We now present a region where the observed cor¬ 
relation is not due to CNV but can be associated with an epigenetic mark. When applied 
to the CNV corrected expression data, SegCorr detects a region of four genes (HOXB3, 
HOXB-AS3, HOXB5, HOXB6: p k = 0.93, p-value = 3.1e-8) in chromosome 17. This 
region has already been studied by |Vallot et al. ( 20TT[ i and has been referred to as 17-7. 
Figure [8] (left) shows a clear pattern in the expression data, which is detected by 
SegCorr. The right panel provides the DNA methylation data for the same region, 
which also depicts a clear pattern. This suggest that this region is silenced by an epi¬ 
genetic mechanism associated with DNA methylation. 

This statement is supported by Figure [9] which depicts the correlation between each 
gene expression and the methylation signal at each locus within the region. It shows 
a large proportion methylation sites being negatively correlated with the expression 
of the four genes. This indicates that high DNA methylation level in this region is 
associated with the silencing of these genes. 


6 Discussion 


The identification of co-regulated chromosomal regions has many implications in biol¬ 
ogy. In this paper, we developed a method to identify these regions and we applied it 
to cancer data. The method relies on a formal definition of what correlated regions are. 
It takes advantage of an efficient algorithm and a statistical test, which are both exact. 
We also proposed a correction strategy that allows us to investigate the possible causes 
of the observed correlations. 

Using this method, we could identify copy number dependent and copy number inde¬ 
pendent correlated regions of expression. Copy number dependent regions correspond 
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Figure 8: Expression (left) and methylation (right) data from Region 17-7. The order¬ 
ing of the patients (x-axis) is kept the same in the two plots. 



Figure 9: Correlations between expression and methylation data from Region 17-7. 
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to genomic alterations; copy number independent regions could be due to different 
mechanisms, including epigenetic mechanism. We showed, for one region, which is 
part of the HOXB complex, that there is negative correlation between expression and 
DNA methylation. The detected regions should be further investigated to better under¬ 
stand the underlying mechanism. 

While the expression data used here were acquired using the microarray technology, 
any other technology, including RNA-seq, can be used as well. 

In our analysis, we have assumed stretches of correlated contiguous neighboring 
genes. This is obviously a simplification. Within a correlated region, a gene (or a 
few genes) could exhibit a weak or even a negative correlation with the other genes. 
This could occur for different reasons: the gene can be not expressed; alternatively, 
the gene could be non affected by the regulation process that impacts the other ones; 
finally, the gene could be impacted in a opposite way compared with the other ones. 
Note that genes that exhibit no expression or no variation in the dataset can be detected 
and could be discarded before applying the analysis. While this preprocessing was 
not required in the present study, running the analysis without removing non-expressed 
genes would lower the performance of any method aimed at finding correlated (and 
reasonably homogeneous) regions. Alternatively, looking for the effect of adding a 
variable number of uncorrelated genes in correlated regions is an obvious follow-up of 
the present work. 

The proposed correction strategy could easily be generalized to more than one sig¬ 
nal to correct for, as it does not rely on a joint modeling of all types of data at hand. 
Furthermore the segmentation used in the correction step enables us to deal with sig¬ 
nals with different probe densities. Finally, this correction approach allowed us to keep 
all tumors in the study, as opposed to (Stransky et al., 2006} were tumors with CNV in 
a given region were excluded when analysing this region. 

Also, prior information on genes or regions could be accounted for in the segmentation 
step. Indeed, the likelihood £(r, t') associated with a given region can be reweighted 
or penalized, the dynamic programming algorithm then applies with the same compu¬ 
tational complexity. 
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Appendix 

A Proof of Lemma (|TJ) 

Throughout this proof, we drop index k for sake of clarity. For a region with length i, 
the covariance matrix E in Equation ([TJ can be rewritten as: 

E = (1 -p)l + pj 
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where I stand for the t x i identity matrix and J for the t / l matrix with all entries 
equal to one. The inverse of this matrix has the form E _1 = al + bJ where 


1 ~P (1-P)[1-P + ^P]‘ 


1 z, P 

-, b = —-—— 


The determinant of E is 


m = (i- p y- i (i-p+ip) 

and the trace term tr [yE _1 (F) T ] in equation 0 yields 


e 


t 


tr (Y(al + 6 J)y T ) = an£ + bn EE<V 


j =i *:=i 


Combining all the above gives the log-likelihood for this region: 

-21og£ = n [log (1 — p + ip) + (£ - 1) log(l — p)\ 

ni _ pnYfjYfGjk 

+ i -p 0-- p)[i- p + tp]' 

Optimizing this function wrt p gives the formula of the MLE 0- Pluging this estimate 
into the same function gives the contrast function given in ([8]). 

B Distribution of the test statistic 

Note Y- k> = {Y i>Th _ 1 + i,Yi, Tk ) T . Using the same notations as in Section 3 one has 



Because variables Y- k \ i = 1 ,n are i.i.d so do variables Y- k> , and consequently 
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